It is advice the use of RStudio IDE and R version of atleast 3.5.2, due to the installation process. The script will automatically install all required packages if not already. The relative positioning of all the files and folders should not be tempered with, even output files. Copies should thus be done. For more information or any doubt concerning the given resources, please consider gathering additional insight or contact through https://jorgeserras.com/
The current study has the aim of providing further research and resources into the detection and modeling of anomalous behavior in multivariate temporal data. A complete system is built to handle multivariate time series (MTS) from pre-processing to score-analysis known as MultivariatE Time sEries OutlieR (METEOR) (Serras 2018).
All data including time stamps are first pre-processed and discretized to ensure the correct employment of the proposed methods. The developed system is based on Dynamic Bayesian Networks (DBN). METEOR is made available for anomaly detection and followed step-be-step. A web application and tutorial are additionally made accessible (Serras 2018) for any endeavor together with supplementary information.
Furthermore, generated output files are available in folder /resources/output_files where several subsets and transformations of the input dataset are present. Additionally, there is a folder /DBN which has the output files concerning outlier detection using the DBN (METEOR) approach.
Outliers can be defined as observations which are suspected of not have been generated by data’s underlying processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data. The proposed approach is first validated through simulated data, demonstrating the system’s performance. Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios such as electrocardiogram series, mortality rate data and written pen digits. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application employing the complete system is made available jointly with a tutorial.
To begin, the input data is loaded. If the dataset is not in the correct format, please check the examples and tools made available to ensure a correct employment. The continuous MTS input dataset should be in the /resources folder. The code below must be changed accordingly and the number of variables n_variables specified. If data is already discretized, skip the SAX Discretization phase.
####
original_continuous_data <- as.data.frame(read.csv(file = paste(read_path, "/continuous_example_data.csv", sep = ""), header=TRUE))
n_variables <- 2 # For the SAX plot function, 1 means the subjects are univariate
In the first phase, a method known as Symbolic Aggregate approXimation (SAX) (Lin et al. 2003) is applied to discretize time series (TS) into strings. It is capable of transforming an univariate TS of length n after normalization into a string of arbitrary length w. The latter length can be w << n if Piecewise Aggregate Approximation (PAA) is applied. This ensures dimensionality reduction of the normalized TS when excessive measurements are present. An alphabet size must be chosen to determine the number of symbols for the final discretized sequences. Such assumes that normalized TS possess Gaussian distribution probabilities with unitary standard deviation and null mean. For additional information, please see (Lin et al. 2003).
SAX does not allow missing values, imputation of the missing values should be considered as future work. The SAX discretization is the same available in the webapp: https://jorgeserras.shinyapps.io/outlierdetection/
Parameters can be changed in the code below, allowing the adjusting of the alphabet size, PAA reduction size and an auxiliar function for the plot. Feel free to experiment until results are satisfactory.
ATTENTION: It is worth noting that since each TS is normalized, the offset is lost in the process. Such means that it could happen that subjects with extreme values (for example) are brought to zero and be later classified as normal. Visual inspection or the addition of a step before SAX is advised to prevent such scenarios.
### Discretization:
panel_data <- as.data.frame(parseToPanel(original_continuous_data))[,1:(n_variables+2)]
### SAX parameters:
alphabet_size <- 3
paa_size <- as.integer(max(panel_data$timestamp)) #paa_size = ncol() - 1 to prevent dimensionality reduction, -1 because of the "subject_id" column. This is for the univariate case, if data is multivariate, paa_size must be equal to the number of time slices to avoid performing PAA reduction
### SAX parameters: alphabet_size and PAA_size which gives the new reduced numer of time slices for each TS
discrete_outputs <- discretize(panel_data = panel_data, alphabet_size = alphabet_size, paa_size = paa_size)
### Retrieve solely the discretized dataset:
discrete_data <- discrete_outputs[[1]] # The variable names are different but no worries, they are easily recovered later if necessary
normalized_data <- discrete_outputs[[2]]
timestamp <- discrete_outputs[[3]]
subject_id_data <- data.frame(subject_id = discrete_data$subject_id)
### Save in file:
write.csv(discrete_data, file = paste(write_path, "/discrete_data.csv", sep = ""), row.names = FALSE, quote = TRUE)
The results of the discretization can be analyzed in the figure below, which displays the normalized data before preforming PAA reduction (if considered). It shows the mean and standard deviation of each time slice of the dataset. Every TS displayed is normalized with mean = 0 and std = 1. If the number of variables is higher than 1, several curves occur, one for each variable of the MTS. The breakpoints are also visible (pink), delimiting the regions of each symbol of the alphabet chosen. An alphabet size of 3, for example, displays 2 breakpoints separating the entire y domain in 3 equiprobable regions according to a Normal distribution N(0,1).
Results of SAX pre-processing including PAA reduction (if considered). Parameters can be changed below to specify which subjects to observe.
kable(data.frame(discrete_data[1:7,]),caption="SAX discretized subjects")
| subject_id | V1__1 | V2__1 | V1__2 | V2__2 | V1__3 | V2__3 | V1__4 | V2__4 | V1__5 | V2__5 | V1__6 | V2__6 | V1__7 | V2__7 | V1__8 | V2__8 | V1__9 | V2__9 | V1__10 | V2__10 | V1__11 | V2__11 | V1__12 | V2__12 | V1__13 | V2__13 | V1__14 | V2__14 | V1__15 | V2__15 | V1__16 | V2__16 | V1__17 | V2__17 | V1__18 | V2__18 | V1__19 | V2__19 | V1__20 | V2__20 | V1__21 | V2__21 | V1__22 | V2__22 | V1__23 | V2__23 | V1__24 | V2__24 | V1__25 | V2__25 | V1__26 | V2__26 | V1__27 | V2__27 | V1__28 | V2__28 | V1__29 | V2__29 | V1__30 | V2__30 | V1__31 | V2__31 | V1__32 | V2__32 | V1__33 | V2__33 | V1__34 | V2__34 | V1__35 | V2__35 | V1__36 | V2__36 | V1__37 | V2__37 | V1__38 | V2__38 | V1__39 | V2__39 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | c | b | c | a | c | b | c | c | c | c | c | c | b | c | c | c | b | c | c | c | b | c | b | b | b | b | b | b | b | b | b | b | a | b | b | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | b | b | b | b | c | b | c | b | c | b |
| 1 | c | b | c | a | c | a | c | b | c | c | b | c | b | c | b | c | b | c | b | c | b | c | b | c | b | b | b | b | b | b | b | b | b | b | b | b | b | b | a | b | b | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | b | a | b | a | c | a | c | b | c | b |
| 2 | c | b | c | b | c | b | c | a | c | b | c | c | c | c | b | c | b | c | b | c | b | c | b | c | b | c | b | c | a | b | b | b | a | b | a | b | a | b | a | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | c | a | c | a | c | b | c | b | c | b |
| 3 | c | b | c | b | c | a | c | a | c | b | c | c | c | c | b | c | b | c | c | c | b | c | b | c | b | c | b | b | b | b | b | b | b | b | b | b | a | b | a | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | b | a | b | a | c | b | c | b | c | b |
| 4 | c | b | c | b | c | a | c | a | c | b | c | c | c | c | c | c | b | c | c | c | b | c | b | c | b | c | b | b | b | b | b | b | b | b | a | b | a | b | a | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | b | a | b | b | b | b | b | b |
| 5 | c | b | c | b | c | a | c | a | c | c | c | b | c | c | b | c | b | c | b | c | b | c | b | c | b | c | b | b | b | b | a | b | a | b | a | b | a | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | c | a | c | a | c | a | c | b | c | b | c | b |
| 6 | c | b | c | b | c | a | c | a | c | b | c | c | c | c | c | c | b | c | c | c | b | c | b | c | b | c | b | b | b | b | b | b | a | b | a | b | a | b | a | b | a | b | a | b | a | b | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | a | b | a | b | a | b | a | c | a | c | a | c | a | c | b | c | b |
ATTENTION: If PAA reduction was performed, the transition plots displaying outliers as red are not correct, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition t classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).
Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system known as METEOR (Serras 2018) covering problems since pre-processing that adopts a dynamic Bayesian network (DBN) modeling algorithm (Monteiro, Vinga, and Carvalho 2015). The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application (Serras 2018) employing the complete system is made available jointly with a tutorial.
Outliers are considered to be observations suspicious of not have being generated by the data’s main underlying processes. Such instances generally do not fit the trained model. The goal is to score portions as well as entire MTS. For each transition of a subject’s MTS, a window composed by its observed attributes is built, dependent of the order of the trained DBN. A transition score represents the log-likelihood (LL) of the observed window computed using the network’s conditional probabilities. It is worth noting that a transition score does not represent the outlierness of a time slice, but rather its context. A score of an entire TS (subject) is computed using the average of all the transition scores from that subject. The most negative a score is, the more anomalous it is.
By having parameters markov_lag = 1 and previous_parents = 1, for example, each measurement at time t is associated to its immediate precedent (value at slice t-1).
Issues may occur in this phase due to rJava package. This happens due to the still in progress development of the package (bugs and features are still being coded) and Java versions in question. Most of the times, running the script repeatedly fixes the issues (it simply does, don’t ask me). Try to rerun until it works. If problems continue to occur, consider reinstalling rJava and updating Java (unnistalling previous version) or using the online version of the Web application https://jorgeserras.shinyapps.io/outlierdetection/ instead of running through code. The Dynamic Bayesian Network Outlier Detection system for multivariate time series is used. Check the main webpages: https://jorgeserras.shinyapps.io/outlierdetection/ and https://jorgeserras.com/MTS_OutlierDetection/ for additional information. Don’t forget there is a tutorial video available!
NOTE: If the SAX discretization phase was skipped, due to data already being discrete, please insert the discrete MTS dataset in format .csv in the folder “/resources/output_files” and change the name below from “/discrete_data.csv” to the name of your file. There is an example file in the folder named “/discrete_example_data.csv”.
### Both transitions and complete subjects are scored
### Name of the file where the discrete data is stored:
data_file_name <- "/discrete_data.csv"
DBN_output <- DBN_scoring(data_file_name, write_path, markov_lag = 1, previous_parents = 1, checkbox_stationary = TRUE, score_function = "ll")
DBN_transition_scores <- DBN_output[[1]]
DBN_subject_scores <- DBN_output[[2]]
# Join all the DBN scores (transitions and subjects) in one file/data.frame
DBN_scores <- DBN_transition_scores[,-1]
DBN_scores <- cbind(data.frame(subject_id=subject_id_data),DBN_scores,data.frame(subject_score=DBN_subject_scores$score))
# Restore the true subject_id values that are lost in the process
DBN_transition_scores$subject_id <- NULL
DBN_subject_scores$subject_id <- NULL
DBN_transition_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_transition_scores))
DBN_subject_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_subject_scores))
Visualization of the DBN modeled, this can be stationary or non-stationary. Nodes blue outlined concern attributes at the present time slice (t) being scored. All other nodes concern attributes from previous slices (t-1,…) according to the structure of the transition networks. A node Vi[t] concerns the value of the i-th variable from the dataset at time slice t.
By default, the DBNs are modeled using a LogLikelihood function, a parameter can be changed to used the MDL instead. Such allows a more constraint training, which usually reduces the number of connections in the transition networks.
grViz("dbn_structure_dot") # Plot the DBN using file 'dbn_structure_dot'
The scoring results for transitions and entire subjects are observable below, being the subjects sorted by their subject scores (the lower the value, the more anomalous). The subjects to be displayed can be specified in the function’s argument. The results are written in files DBN_scores_combined.csv and DBN_scores_combined_sort.csv, available in the folder ‘/resources/output_files/DBN’ together with all other DBN outputs.
kable(data.frame(DBN_sorted_scores[1:7,]),caption="DBN transition and subject scores")
| subject_id | t_0 | t_1 | t_2 | t_3 | t_4 | t_5 | t_6 | t_7 | t_8 | t_9 | t_10 | t_11 | t_12 | t_13 | t_14 | t_15 | t_16 | t_17 | t_18 | t_19 | t_20 | t_21 | t_22 | t_23 | t_24 | t_25 | t_26 | t_27 | t_28 | t_29 | t_30 | t_31 | t_32 | t_33 | t_34 | t_35 | t_36 | t_37 | subject_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 193 | 192 | -0.0409921 | -0.0409921 | -0.0409921 | -0.0409921 | -6.9097573 | -8.006368 | -1.568219 | -8.0063676 | -7.6642942 | -8.0063676 | -8.0063676 | -3.2898548 | -8.006368 | -3.289855 | -1.100614 | -8.006368 | -3.289855 | -8.0063676 | -9.098998 | -1.100614 | -1.100614 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -8.0063676 | -1.910293 | -0.8487380 | -6.1742072 | -0.8487380 | -3.3489018 | -6.1742072 | -4.1533086 | -0.7638126 | -0.7638126 | -6.1742072 | -1.5682194 | -3.562528 |
| 74 | 73 | -0.0409921 | -0.0409921 | -0.0409921 | -0.0409921 | -11.9833502 | -1.568219 | -8.006368 | -1.1006143 | -1.1006143 | -8.0063676 | -3.2898548 | -8.0063676 | -3.289855 | -1.100614 | -1.100614 | -6.909757 | -1.100614 | -0.0040040 | -8.006368 | -3.289855 | -0.004004 | -6.9097573 | -1.1006143 | -0.0040040 | -6.9097573 | -6.9097573 | -6.9448128 | -13.815511 | -8.0063676 | -1.9102928 | -1.9102928 | -0.8487380 | -0.0443313 | -3.3489018 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -3.414468 |
| 199 | 198 | -0.6426953 | -0.0443313 | -0.0443313 | -6.9150290 | -3.4444222 | -8.006368 | -6.909757 | -0.8453989 | -0.0409921 | -0.0409921 | -5.0775970 | -0.8453989 | -11.983350 | -1.568219 | -8.006368 | -8.006368 | -1.910293 | -3.2898548 | -8.006368 | -3.289855 | -1.100614 | -8.0063676 | -3.2898548 | -1.1006143 | -1.1006143 | -1.1006143 | -1.1006143 | -6.909757 | -1.1006143 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -8.0063676 | -1.9102928 | -0.8487380 | -6.1742072 | -3.175981 |
| 68 | 67 | -1.4471021 | -0.0443313 | -0.0443313 | -3.3489018 | -0.7638126 | -8.006368 | -6.909757 | -6.9414737 | -0.0409921 | -5.0775970 | -0.8136825 | -0.8453989 | -11.983350 | -1.910293 | -1.910293 | -1.910293 | -1.910293 | -3.2898548 | -1.100614 | -8.006368 | -3.289855 | -1.1006143 | -1.1006143 | -8.0063676 | -3.2898548 | -1.1006143 | -1.1006143 | -8.006368 | -3.2898548 | -1.1006143 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -8.0063676 | -1.9102928 | -1.9102928 | -0.8487380 | -2.904794 |
| 167 | 166 | -0.0409921 | -0.0409921 | -0.0409921 | -0.0409921 | -3.4444222 | -8.006368 | -7.664294 | -6.1742072 | -3.2898548 | -8.0063676 | -1.9102928 | -3.2898548 | -8.006368 | -3.289855 | -8.006368 | -3.289855 | -1.100614 | -1.1006143 | -1.100614 | -1.100614 | -1.100614 | -6.9097573 | -1.1006143 | -0.0040040 | -0.0040040 | -0.0040040 | -8.0063676 | -2.193245 | -8.0063676 | -1.9102928 | -1.9102928 | -1.9102928 | -0.8487380 | -0.0443313 | -0.0443313 | -3.3489018 | -0.7638126 | -0.7638126 | -2.837324 |
| 108 | 107 | -0.0040040 | -0.0040040 | -0.0040040 | -0.0040040 | -6.9097573 | -1.100614 | -1.100614 | -8.0063676 | -1.9102928 | -3.2898548 | -8.0063676 | -3.2898548 | -8.006368 | -3.289855 | -8.006368 | -9.098998 | -6.909757 | -0.8136825 | -9.098998 | -6.909757 | -6.909757 | -0.8136825 | -0.8136825 | -0.8136825 | -0.8453989 | -0.0409921 | -0.0409921 | -3.444422 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -0.7638126 | -2.819059 |
| 94 | 93 | -0.6426953 | -0.0443313 | -0.0443313 | -3.3489018 | -0.7638126 | -8.006368 | -6.909757 | -0.8453989 | -0.0409921 | -5.0775970 | -0.8453989 | -5.0775970 | -7.719436 | -3.289855 | -1.100614 | -8.006368 | -3.289855 | -1.1006143 | -8.006368 | -3.289855 | -1.100614 | -1.1006143 | -1.1006143 | -1.1006143 | -1.1006143 | -1.1006143 | -1.1006143 | -1.100614 | -6.9097573 | -1.1006143 | -0.0040040 | -0.0040040 | -8.0063676 | -1.9102928 | -1.9102928 | -1.9102928 | -0.8487380 | -0.0443313 | -2.602730 |
A score-analysis phase is preformed to ensure the disclosure of the boundary between both classes “Anomalous” and “Normal”. Two strategies are tested, Tukey’s strategy (Tukey 1977) and Gaussian Mixture Models (GMM) (McLachlan 2018) using 2 classes. The latter is preformed using package “mclust” (Fraley et al. 2017). More information is also available in the webpages of the developed web application. Both score-analysis approaches are applied to subject scores and to transition scores from the DBN approach.
Tukey’s strategy uses the Interquartile Range (IQR), which is the difference between the third and first quartiles of the score’s distribution. This measures the values’ dispersion. The computed boundary comes thus:
\[ Threshold = Q1 - (1.5 \cdot IQR), \quad IQR = Q1 - Q3,\] meaning that scores below the threshold are classified as anomalous. Tukey’s strategy assumes normal scores form a symmetric cluster, meaning outliers form a negative skew in the distribution.
In the GMM score-analysis, disjoint score distributions are tackled. Scores are assumed to be generated from a finite mixture of Gaussian distributions with unknown parameters. Score distributions are modeled as mixtures of two Gaussian curves. Labeling each score becomes a classification problem among two classes C1 and C2, representing abnormality and normality respectively. The problem is defined as uncovering the value of
\[P(C_i|s) = \displaystyle \frac{P(s|C_i) \cdot P(C_i)}{P(s)}, \quad i \in 1,2\] for each score value s, which can be obtained by Bayes’ Rule. The threshold is defined as the boundary that better separates both curves, which describes the point of maximum uncertainty. Such means that scores falling below the threshold have an higher likelihood of belonging to the “abnormal” class C2, being classified as outliers.
Perform both score-analysis strategies (Tukey and GMM) for both subject and transition scores from the modeled DBNs.
Both tresholds from the score-analysis methods are now used to classify each score.
First, each transition score is evaluated with both score-analysis strategies using the corresponding thresholds. These are highlighted in red and observed in the next plots.
Recall that data marked as anomalous refers to the whole transition and not solely the values of that specific slice. An entire anomalous transition is red, which ends in the last time slice generated by that transition. Each anomalous transition contains lag+1 time slices due to the size of the window. In the present case, since the Markov lag is 1, an anomalous transition is represented by a red point at t (first instant of the transition) and a red line connecting this first point t and t+1 which is the last slice of the transition. This last point is not marked as red, since it would indicate an outlier for the next transition (beginning in t+1) instead of the one which generated it.
For the results, please check the folder /resources/output_files/DBN to retrieve the outputs of each phase. It is adviced to open the .csv as text files, since some editors may change the appearance of the content.
ATTENTION: If PAA reduction was performed, the transition plots displaying outliers as red are not correct, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition t classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).
Transitions classified as anomalous using Tukey’s strategy on data modeled by the DBN.
Tukey’s threshold: -4.7142438
Results of the final DBN transition classification. Parameters can be changed below to specify which subjects to observe.
kable(data.frame(results.DBN.transition.Tukey[1:7,]),caption="DBN final results using Tukey's threshold")
| subject_id | t_0 | t_1 | t_2 | t_3 | t_4 | t_5 | t_6 | t_7 | t_8 | t_9 | t_10 | t_11 | t_12 | t_13 | t_14 | t_15 | t_16 | t_17 | t_18 | t_19 | t_20 | t_21 | t_22 | t_23 | t_24 | t_25 | t_26 | t_27 | t_28 | t_29 | t_30 | t_31 | t_32 | t_33 | t_34 | t_35 | t_36 | t_37 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Score distribution and the final classification of each transition according to Tukey’s strategy.
Transitions classified as anomalous using the GMM strategy on data modeled by the DBN.
GMM threshold: -2.7415497
Results of the final DBN transition classification. Parameters can be changed below to specify which subjects to observe.
kable(data.frame(results.DBN.transition.GMM[1:7,]),caption="DBN final results using GMM's threshold")
| subject_id | t_0 | t_1 | t_2 | t_3 | t_4 | t_5 | t_6 | t_7 | t_8 | t_9 | t_10 | t_11 | t_12 | t_13 | t_14 | t_15 | t_16 | t_17 | t_18 | t_19 | t_20 | t_21 | t_22 | t_23 | t_24 | t_25 | t_26 | t_27 | t_28 | t_29 | t_30 | t_31 | t_32 | t_33 | t_34 | t_35 | t_36 | t_37 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Score distribution and the final classification of each transition according to the GMM strategy.
Subjects possessing transitions classified as anomalous by METEOR with GMM’s score-analysis:
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
## [12] "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21"
## [23] "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32"
## [34] "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43"
## [45] "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54"
## [56] "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65"
## [67] "66" "67" "68" "69" "70" "71" "72" "73" "74" "75" "76"
## [78] "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87"
## [89] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
## [100] "99" "100" "101" "102" "103" "104" "105" "106" "107" "108" "109"
## [111] "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [122] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131"
## [133] "132" "133" "134" "135" "136" "137" "138" "139" "140" "141" "142"
## [144] "143" "144" "145" "146" "147" "148" "149" "150" "151" "152" "153"
## [155] "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164"
## [166] "165" "166" "167" "168" "169" "170" "171" "172" "173" "174" "175"
## [177] "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186"
## [188] "187" "188" "189" "190" "191" "192" "193" "194" "195" "196" "197"
## [199] "198" "199"
Each subject score is evaluated with both score-analysis strategies using the corresponding thresholds.
The user can change the next code to specify which subjects to plot and which variable, to compare a single variable TS between subjects. This allows a comparison with the next figures which only plot anomalous subjects. This way, different subjects can be analised as required.
Subjects classified as anomalous for both strategies are plotted next, for data modeled by METEOR.
Plot of the anomalous results using Tukey’s score-analysis, some plots can be altered by changing the input of the corresponding functions. The plots only display a single variable, otherwise it would be very confusing to distinguish subjects.
Tukey’s threhsold: -2.6931963
Subjects classified as anomalous by METEOR with Tukey’s score-analysis:
## [1] "67" "73" "107" "166" "192" "198"
Results of the final DBN subject classification. Parameters can be changed below to specify which subjects to observe.
kable(data.frame(results.DBN.subject.Tukey[1:7,]),caption="DBN final results using Tukey's threhold")
| subject_id | outlier |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
Score distribution and the final classification of each subject according to Tukey’s strategy.
***
Plot of the anomalous results using GMM score-analysis, some plots can be altered by changing the input of the corresponding functions.
GMM threhsold: -2.7108949
Subjects classified as anomalous by the DBN approach with the GMM score-analysis:
## [1] "67" "73" "107" "166" "192" "198"
Results of the final DBN subject classification. Parameters can be changed below to specify which subjects to observe.
kable(data.frame(results.DBN.subject.GMM[1:7,]),caption="DBN final results using GMM's threhold")
| subject_id | outlier |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
Score distribution and the final classification of each subject according to the GMM strategy.
Subjects which are classfied as anomalous and at the same time possess anomalous transitions using GMM’s score-analysis are presented below.
## [1] "67" "73" "107" "166" "192" "198"
Uncovering anomalies among MTS measurements is far from an easy task, not only due to the high instability/variability of values from each specific application, but also due to external/unknown factors which could have caused the anomalies. Most detections are usually caused by abrupt changes or abnormal values in multiple consecutive samples. It’s worth noting that outlier subjects which also possess anomalous transitions are more likely to be interesting.
It’s worth noting that there is room for improvement and future research by enhancing the existing methods and extending the present conclusions for other datasets mainly missing data.
Fraley, Chris, Adrian Raftery, Luca Scrucca, Thomas Brendan Murphy, Michael Fop, and Maintainer Luca Scrucca. 2017. “Package Mclust.”
Lin, Jessica, Eamonn J. Keogh, Stefano Lonardi, and Bill Yuan-chi Chiu. 2003. “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms.” In DMKD, 2–11. ACM.
McLachlan, Geoff. 2018. “Finite Mixture Models.” Annual Review of Statistics and Its Application 5 (1). Annual Reviews 4139 El Camino Way, PO Box 10139, Palo Alto, California 94303-0139, USA.
Monteiro, Jose L, Susana Vinga, and Alexandra M Carvalho. 2015. “Polynomial-Time Algorithm for Learning Optimal Tree-Augmented Dynamic Bayesian Networks.” In UAI, 622–31.
Serras, Jorge Luis Fuzeta. 2018. “Dynamic Bayesian Outlier Detection.” https://jorgeserras.shinyapps.io/outlierdetection/.
Tukey, John W. 1977. Exploratory Data Analysis. Vol. 2. Reading, Mass.